Introduction to Blaze

In this tutorial we'll learn how to use Blaze to discover, migrate, and query data living in other databases. Generally this tutorial will have the following format

  1. odo - Move data to database
  2. blaze - Query data in database

Goal: Accessible, Interactive, Analytic Queries

NumPy and Pandas provide accessible, interactive, analytic queries; this is valuable.


In [1]:
import pandas as pd
df = pd.read_csv('data/iris.csv')
df.head()


Out[1]:
Unnamed: 0 SepalLength SepalWidth PetalLength PetalWidth Species
0 1 5.1 3.5 1.4 0.2 setosa
1 2 4.9 3.0 1.4 0.2 setosa
2 3 4.7 3.2 1.3 0.2 setosa
3 4 4.6 3.1 1.5 0.2 setosa
4 5 5.0 3.6 1.4 0.2 setosa

In [2]:
df.groupby(df.Species).PetalLength.mean()  # Average petal length per species


Out[2]:
Species
setosa        1.462
versicolor    4.260
virginica     5.552
Name: PetalLength, dtype: float64

But as data grows and systems become more complex, moving data and querying data become more difficult. Python already has excellent tools for data that fits in memory, but we want to hook up to data that is inconvenient.

From now on, we're going to assume one of the following:

  1. You have an inconvenient amount of data
  2. That data should live someplace other than your computer

Databases and Python

When in-memory arrays/dataframes cease to be an option, we turn to databases. These live outside of the Python process and so might be less convenient. The open source Python ecosystem includes libraries to interact with these databases and with foreign data in general.

Examples:

Today we're going to use some of these indirectly with odo (was into) and Blaze. We'll try to point out these libraries as we automate them so that, if you'd like, you can use them independently.


odo (formerly into)

Odo migrates data between formats and locations.

Before we can use a database we need to move data into it. The odo project provides a single consistent interface to move data between formats and between locations.

We'll start with local data and eventually move out to remote data.

odo docs


Examples

Odo moves data into a target from a source

>>> odo(source, target)

The target and source can be either a Python object or a string URI. The following are all valid calls to into

>>> odo('iris.csv', pd.DataFrame)  # Load CSV file into new DataFrame
>>> odo(my_df, 'iris.json')        # Write DataFrame into JSON file
>>> odo('iris.csv', 'iris.json')   # Migrate data from CSV to JSON

Exercise

Use odo to load the iris.csv file into a Python list, a np.ndarray, and a pd.DataFrame


In [4]:
from odo import odo
import numpy as np
import pandas as pd

In [5]:
odo("data/iris.csv", pd.DataFrame, index_col=0)


Out[5]:
SepalLength SepalWidth PetalLength PetalWidth Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa
10 5.4 3.7 1.5 0.2 setosa
11 4.8 3.4 1.6 0.2 setosa
12 4.8 3.0 1.4 0.1 setosa
13 4.3 3.0 1.1 0.1 setosa
14 5.8 4.0 1.2 0.2 setosa
15 5.7 4.4 1.5 0.4 setosa
16 5.4 3.9 1.3 0.4 setosa
17 5.1 3.5 1.4 0.3 setosa
18 5.7 3.8 1.7 0.3 setosa
19 5.1 3.8 1.5 0.3 setosa
20 5.4 3.4 1.7 0.2 setosa
21 5.1 3.7 1.5 0.4 setosa
22 4.6 3.6 1.0 0.2 setosa
23 5.1 3.3 1.7 0.5 setosa
24 4.8 3.4 1.9 0.2 setosa
25 5.0 3.0 1.6 0.2 setosa
26 5.0 3.4 1.6 0.4 setosa
27 5.2 3.5 1.5 0.2 setosa
28 5.2 3.4 1.4 0.2 setosa
29 4.7 3.2 1.6 0.2 setosa
... ... ... ... ... ...
120 6.9 3.2 5.7 2.3 virginica
121 5.6 2.8 4.9 2.0 virginica
122 7.7 2.8 6.7 2.0 virginica
123 6.3 2.7 4.9 1.8 virginica
124 6.7 3.3 5.7 2.1 virginica
125 7.2 3.2 6.0 1.8 virginica
126 6.2 2.8 4.8 1.8 virginica
127 6.1 3.0 4.9 1.8 virginica
128 6.4 2.8 5.6 2.1 virginica
129 7.2 3.0 5.8 1.6 virginica
130 7.4 2.8 6.1 1.9 virginica
131 7.9 3.8 6.4 2.0 virginica
132 6.4 2.8 5.6 2.2 virginica
133 6.3 2.8 5.1 1.5 virginica
134 6.1 2.6 5.6 1.4 virginica
135 7.7 3.0 6.1 2.3 virginica
136 6.3 3.4 5.6 2.4 virginica
137 6.4 3.1 5.5 1.8 virginica
138 6.0 3.0 4.8 1.8 virginica
139 6.9 3.1 5.4 2.1 virginica
140 6.7 3.1 5.6 2.4 virginica
141 6.9 3.1 5.1 2.3 virginica
142 5.8 2.7 5.1 1.9 virginica
143 6.8 3.2 5.9 2.3 virginica
144 6.7 3.3 5.7 2.5 virginica
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns


In [5]:
odo("data/iris.csv", list)


Out[5]:
[(1, 5.1, 3.5, 1.4, 0.2, u'setosa'),
 (2, 4.9, 3.0, 1.4, 0.2, u'setosa'),
 (3, 4.7, 3.2, 1.3, 0.2, u'setosa'),
 (4, 4.6, 3.1, 1.5, 0.2, u'setosa'),
 (5, 5.0, 3.6, 1.4, 0.2, u'setosa'),
 (6, 5.4, 3.9, 1.7, 0.4, u'setosa'),
 (7, 4.6, 3.4, 1.4, 0.3, u'setosa'),
 (8, 5.0, 3.4, 1.5, 0.2, u'setosa'),
 (9, 4.4, 2.9, 1.4, 0.2, u'setosa'),
 (10, 4.9, 3.1, 1.5, 0.1, u'setosa'),
 (11, 5.4, 3.7, 1.5, 0.2, u'setosa'),
 (12, 4.8, 3.4, 1.6, 0.2, u'setosa'),
 (13, 4.8, 3.0, 1.4, 0.1, u'setosa'),
 (14, 4.3, 3.0, 1.1, 0.1, u'setosa'),
 (15, 5.8, 4.0, 1.2, 0.2, u'setosa'),
 (16, 5.7, 4.4, 1.5, 0.4, u'setosa'),
 (17, 5.4, 3.9, 1.3, 0.4, u'setosa'),
 (18, 5.1, 3.5, 1.4, 0.3, u'setosa'),
 (19, 5.7, 3.8, 1.7, 0.3, u'setosa'),
 (20, 5.1, 3.8, 1.5, 0.3, u'setosa'),
 (21, 5.4, 3.4, 1.7, 0.2, u'setosa'),
 (22, 5.1, 3.7, 1.5, 0.4, u'setosa'),
 (23, 4.6, 3.6, 1.0, 0.2, u'setosa'),
 (24, 5.1, 3.3, 1.7, 0.5, u'setosa'),
 (25, 4.8, 3.4, 1.9, 0.2, u'setosa'),
 (26, 5.0, 3.0, 1.6, 0.2, u'setosa'),
 (27, 5.0, 3.4, 1.6, 0.4, u'setosa'),
 (28, 5.2, 3.5, 1.5, 0.2, u'setosa'),
 (29, 5.2, 3.4, 1.4, 0.2, u'setosa'),
 (30, 4.7, 3.2, 1.6, 0.2, u'setosa'),
 (31, 4.8, 3.1, 1.6, 0.2, u'setosa'),
 (32, 5.4, 3.4, 1.5, 0.4, u'setosa'),
 (33, 5.2, 4.1, 1.5, 0.1, u'setosa'),
 (34, 5.5, 4.2, 1.4, 0.2, u'setosa'),
 (35, 4.9, 3.1, 1.5, 0.2, u'setosa'),
 (36, 5.0, 3.2, 1.2, 0.2, u'setosa'),
 (37, 5.5, 3.5, 1.3, 0.2, u'setosa'),
 (38, 4.9, 3.6, 1.4, 0.1, u'setosa'),
 (39, 4.4, 3.0, 1.3, 0.2, u'setosa'),
 (40, 5.1, 3.4, 1.5, 0.2, u'setosa'),
 (41, 5.0, 3.5, 1.3, 0.3, u'setosa'),
 (42, 4.5, 2.3, 1.3, 0.3, u'setosa'),
 (43, 4.4, 3.2, 1.3, 0.2, u'setosa'),
 (44, 5.0, 3.5, 1.6, 0.6, u'setosa'),
 (45, 5.1, 3.8, 1.9, 0.4, u'setosa'),
 (46, 4.8, 3.0, 1.4, 0.3, u'setosa'),
 (47, 5.1, 3.8, 1.6, 0.2, u'setosa'),
 (48, 4.6, 3.2, 1.4, 0.2, u'setosa'),
 (49, 5.3, 3.7, 1.5, 0.2, u'setosa'),
 (50, 5.0, 3.3, 1.4, 0.2, u'setosa'),
 (51, 7.0, 3.2, 4.7, 1.4, u'versicolor'),
 (52, 6.4, 3.2, 4.5, 1.5, u'versicolor'),
 (53, 6.9, 3.1, 4.9, 1.5, u'versicolor'),
 (54, 5.5, 2.3, 4.0, 1.3, u'versicolor'),
 (55, 6.5, 2.8, 4.6, 1.5, u'versicolor'),
 (56, 5.7, 2.8, 4.5, 1.3, u'versicolor'),
 (57, 6.3, 3.3, 4.7, 1.6, u'versicolor'),
 (58, 4.9, 2.4, 3.3, 1.0, u'versicolor'),
 (59, 6.6, 2.9, 4.6, 1.3, u'versicolor'),
 (60, 5.2, 2.7, 3.9, 1.4, u'versicolor'),
 (61, 5.0, 2.0, 3.5, 1.0, u'versicolor'),
 (62, 5.9, 3.0, 4.2, 1.5, u'versicolor'),
 (63, 6.0, 2.2, 4.0, 1.0, u'versicolor'),
 (64, 6.1, 2.9, 4.7, 1.4, u'versicolor'),
 (65, 5.6, 2.9, 3.6, 1.3, u'versicolor'),
 (66, 6.7, 3.1, 4.4, 1.4, u'versicolor'),
 (67, 5.6, 3.0, 4.5, 1.5, u'versicolor'),
 (68, 5.8, 2.7, 4.1, 1.0, u'versicolor'),
 (69, 6.2, 2.2, 4.5, 1.5, u'versicolor'),
 (70, 5.6, 2.5, 3.9, 1.1, u'versicolor'),
 (71, 5.9, 3.2, 4.8, 1.8, u'versicolor'),
 (72, 6.1, 2.8, 4.0, 1.3, u'versicolor'),
 (73, 6.3, 2.5, 4.9, 1.5, u'versicolor'),
 (74, 6.1, 2.8, 4.7, 1.2, u'versicolor'),
 (75, 6.4, 2.9, 4.3, 1.3, u'versicolor'),
 (76, 6.6, 3.0, 4.4, 1.4, u'versicolor'),
 (77, 6.8, 2.8, 4.8, 1.4, u'versicolor'),
 (78, 6.7, 3.0, 5.0, 1.7, u'versicolor'),
 (79, 6.0, 2.9, 4.5, 1.5, u'versicolor'),
 (80, 5.7, 2.6, 3.5, 1.0, u'versicolor'),
 (81, 5.5, 2.4, 3.8, 1.1, u'versicolor'),
 (82, 5.5, 2.4, 3.7, 1.0, u'versicolor'),
 (83, 5.8, 2.7, 3.9, 1.2, u'versicolor'),
 (84, 6.0, 2.7, 5.1, 1.6, u'versicolor'),
 (85, 5.4, 3.0, 4.5, 1.5, u'versicolor'),
 (86, 6.0, 3.4, 4.5, 1.6, u'versicolor'),
 (87, 6.7, 3.1, 4.7, 1.5, u'versicolor'),
 (88, 6.3, 2.3, 4.4, 1.3, u'versicolor'),
 (89, 5.6, 3.0, 4.1, 1.3, u'versicolor'),
 (90, 5.5, 2.5, 4.0, 1.3, u'versicolor'),
 (91, 5.5, 2.6, 4.4, 1.2, u'versicolor'),
 (92, 6.1, 3.0, 4.6, 1.4, u'versicolor'),
 (93, 5.8, 2.6, 4.0, 1.2, u'versicolor'),
 (94, 5.0, 2.3, 3.3, 1.0, u'versicolor'),
 (95, 5.6, 2.7, 4.2, 1.3, u'versicolor'),
 (96, 5.7, 3.0, 4.2, 1.2, u'versicolor'),
 (97, 5.7, 2.9, 4.2, 1.3, u'versicolor'),
 (98, 6.2, 2.9, 4.3, 1.3, u'versicolor'),
 (99, 5.1, 2.5, 3.0, 1.1, u'versicolor'),
 (100, 5.7, 2.8, 4.1, 1.3, u'versicolor'),
 (101, 6.3, 3.3, 6.0, 2.5, u'virginica'),
 (102, 5.8, 2.7, 5.1, 1.9, u'virginica'),
 (103, 7.1, 3.0, 5.9, 2.1, u'virginica'),
 (104, 6.3, 2.9, 5.6, 1.8, u'virginica'),
 (105, 6.5, 3.0, 5.8, 2.2, u'virginica'),
 (106, 7.6, 3.0, 6.6, 2.1, u'virginica'),
 (107, 4.9, 2.5, 4.5, 1.7, u'virginica'),
 (108, 7.3, 2.9, 6.3, 1.8, u'virginica'),
 (109, 6.7, 2.5, 5.8, 1.8, u'virginica'),
 (110, 7.2, 3.6, 6.1, 2.5, u'virginica'),
 (111, 6.5, 3.2, 5.1, 2.0, u'virginica'),
 (112, 6.4, 2.7, 5.3, 1.9, u'virginica'),
 (113, 6.8, 3.0, 5.5, 2.1, u'virginica'),
 (114, 5.7, 2.5, 5.0, 2.0, u'virginica'),
 (115, 5.8, 2.8, 5.1, 2.4, u'virginica'),
 (116, 6.4, 3.2, 5.3, 2.3, u'virginica'),
 (117, 6.5, 3.0, 5.5, 1.8, u'virginica'),
 (118, 7.7, 3.8, 6.7, 2.2, u'virginica'),
 (119, 7.7, 2.6, 6.9, 2.3, u'virginica'),
 (120, 6.0, 2.2, 5.0, 1.5, u'virginica'),
 (121, 6.9, 3.2, 5.7, 2.3, u'virginica'),
 (122, 5.6, 2.8, 4.9, 2.0, u'virginica'),
 (123, 7.7, 2.8, 6.7, 2.0, u'virginica'),
 (124, 6.3, 2.7, 4.9, 1.8, u'virginica'),
 (125, 6.7, 3.3, 5.7, 2.1, u'virginica'),
 (126, 7.2, 3.2, 6.0, 1.8, u'virginica'),
 (127, 6.2, 2.8, 4.8, 1.8, u'virginica'),
 (128, 6.1, 3.0, 4.9, 1.8, u'virginica'),
 (129, 6.4, 2.8, 5.6, 2.1, u'virginica'),
 (130, 7.2, 3.0, 5.8, 1.6, u'virginica'),
 (131, 7.4, 2.8, 6.1, 1.9, u'virginica'),
 (132, 7.9, 3.8, 6.4, 2.0, u'virginica'),
 (133, 6.4, 2.8, 5.6, 2.2, u'virginica'),
 (134, 6.3, 2.8, 5.1, 1.5, u'virginica'),
 (135, 6.1, 2.6, 5.6, 1.4, u'virginica'),
 (136, 7.7, 3.0, 6.1, 2.3, u'virginica'),
 (137, 6.3, 3.4, 5.6, 2.4, u'virginica'),
 (138, 6.4, 3.1, 5.5, 1.8, u'virginica'),
 (139, 6.0, 3.0, 4.8, 1.8, u'virginica'),
 (140, 6.9, 3.1, 5.4, 2.1, u'virginica'),
 (141, 6.7, 3.1, 5.6, 2.4, u'virginica'),
 (142, 6.9, 3.1, 5.1, 2.3, u'virginica'),
 (143, 5.8, 2.7, 5.1, 1.9, u'virginica'),
 (144, 6.8, 3.2, 5.9, 2.3, u'virginica'),
 (145, 6.7, 3.3, 5.7, 2.5, u'virginica'),
 (146, 6.7, 3.0, 5.2, 2.3, u'virginica'),
 (147, 6.3, 2.5, 5.0, 1.9, u'virginica'),
 (148, 6.5, 3.0, 5.2, 2.0, u'virginica'),
 (149, 6.2, 3.4, 5.4, 2.3, u'virginica'),
 (150, 5.9, 3.0, 5.1, 1.8, u'virginica')]

In [6]:
odo("data/iris.csv", np.ndarray)


Out[6]:
rec.array([(1, 5.1, 3.5, 1.4, 0.2, u'setosa'),
       (2, 4.9, 3.0, 1.4, 0.2, u'setosa'),
       (3, 4.7, 3.2, 1.3, 0.2, u'setosa'),
       (4, 4.6, 3.1, 1.5, 0.2, u'setosa'),
       (5, 5.0, 3.6, 1.4, 0.2, u'setosa'),
       (6, 5.4, 3.9, 1.7, 0.4, u'setosa'),
       (7, 4.6, 3.4, 1.4, 0.3, u'setosa'),
       (8, 5.0, 3.4, 1.5, 0.2, u'setosa'),
       (9, 4.4, 2.9, 1.4, 0.2, u'setosa'),
       (10, 4.9, 3.1, 1.5, 0.1, u'setosa'),
       (11, 5.4, 3.7, 1.5, 0.2, u'setosa'),
       (12, 4.8, 3.4, 1.6, 0.2, u'setosa'),
       (13, 4.8, 3.0, 1.4, 0.1, u'setosa'),
       (14, 4.3, 3.0, 1.1, 0.1, u'setosa'),
       (15, 5.8, 4.0, 1.2, 0.2, u'setosa'),
       (16, 5.7, 4.4, 1.5, 0.4, u'setosa'),
       (17, 5.4, 3.9, 1.3, 0.4, u'setosa'),
       (18, 5.1, 3.5, 1.4, 0.3, u'setosa'),
       (19, 5.7, 3.8, 1.7, 0.3, u'setosa'),
       (20, 5.1, 3.8, 1.5, 0.3, u'setosa'),
       (21, 5.4, 3.4, 1.7, 0.2, u'setosa'),
       (22, 5.1, 3.7, 1.5, 0.4, u'setosa'),
       (23, 4.6, 3.6, 1.0, 0.2, u'setosa'),
       (24, 5.1, 3.3, 1.7, 0.5, u'setosa'),
       (25, 4.8, 3.4, 1.9, 0.2, u'setosa'),
       (26, 5.0, 3.0, 1.6, 0.2, u'setosa'),
       (27, 5.0, 3.4, 1.6, 0.4, u'setosa'),
       (28, 5.2, 3.5, 1.5, 0.2, u'setosa'),
       (29, 5.2, 3.4, 1.4, 0.2, u'setosa'),
       (30, 4.7, 3.2, 1.6, 0.2, u'setosa'),
       (31, 4.8, 3.1, 1.6, 0.2, u'setosa'),
       (32, 5.4, 3.4, 1.5, 0.4, u'setosa'),
       (33, 5.2, 4.1, 1.5, 0.1, u'setosa'),
       (34, 5.5, 4.2, 1.4, 0.2, u'setosa'),
       (35, 4.9, 3.1, 1.5, 0.2, u'setosa'),
       (36, 5.0, 3.2, 1.2, 0.2, u'setosa'),
       (37, 5.5, 3.5, 1.3, 0.2, u'setosa'),
       (38, 4.9, 3.6, 1.4, 0.1, u'setosa'),
       (39, 4.4, 3.0, 1.3, 0.2, u'setosa'),
       (40, 5.1, 3.4, 1.5, 0.2, u'setosa'),
       (41, 5.0, 3.5, 1.3, 0.3, u'setosa'),
       (42, 4.5, 2.3, 1.3, 0.3, u'setosa'),
       (43, 4.4, 3.2, 1.3, 0.2, u'setosa'),
       (44, 5.0, 3.5, 1.6, 0.6, u'setosa'),
       (45, 5.1, 3.8, 1.9, 0.4, u'setosa'),
       (46, 4.8, 3.0, 1.4, 0.3, u'setosa'),
       (47, 5.1, 3.8, 1.6, 0.2, u'setosa'),
       (48, 4.6, 3.2, 1.4, 0.2, u'setosa'),
       (49, 5.3, 3.7, 1.5, 0.2, u'setosa'),
       (50, 5.0, 3.3, 1.4, 0.2, u'setosa'),
       (51, 7.0, 3.2, 4.7, 1.4, u'versicolor'),
       (52, 6.4, 3.2, 4.5, 1.5, u'versicolor'),
       (53, 6.9, 3.1, 4.9, 1.5, u'versicolor'),
       (54, 5.5, 2.3, 4.0, 1.3, u'versicolor'),
       (55, 6.5, 2.8, 4.6, 1.5, u'versicolor'),
       (56, 5.7, 2.8, 4.5, 1.3, u'versicolor'),
       (57, 6.3, 3.3, 4.7, 1.6, u'versicolor'),
       (58, 4.9, 2.4, 3.3, 1.0, u'versicolor'),
       (59, 6.6, 2.9, 4.6, 1.3, u'versicolor'),
       (60, 5.2, 2.7, 3.9, 1.4, u'versicolor'),
       (61, 5.0, 2.0, 3.5, 1.0, u'versicolor'),
       (62, 5.9, 3.0, 4.2, 1.5, u'versicolor'),
       (63, 6.0, 2.2, 4.0, 1.0, u'versicolor'),
       (64, 6.1, 2.9, 4.7, 1.4, u'versicolor'),
       (65, 5.6, 2.9, 3.6, 1.3, u'versicolor'),
       (66, 6.7, 3.1, 4.4, 1.4, u'versicolor'),
       (67, 5.6, 3.0, 4.5, 1.5, u'versicolor'),
       (68, 5.8, 2.7, 4.1, 1.0, u'versicolor'),
       (69, 6.2, 2.2, 4.5, 1.5, u'versicolor'),
       (70, 5.6, 2.5, 3.9, 1.1, u'versicolor'),
       (71, 5.9, 3.2, 4.8, 1.8, u'versicolor'),
       (72, 6.1, 2.8, 4.0, 1.3, u'versicolor'),
       (73, 6.3, 2.5, 4.9, 1.5, u'versicolor'),
       (74, 6.1, 2.8, 4.7, 1.2, u'versicolor'),
       (75, 6.4, 2.9, 4.3, 1.3, u'versicolor'),
       (76, 6.6, 3.0, 4.4, 1.4, u'versicolor'),
       (77, 6.8, 2.8, 4.8, 1.4, u'versicolor'),
       (78, 6.7, 3.0, 5.0, 1.7, u'versicolor'),
       (79, 6.0, 2.9, 4.5, 1.5, u'versicolor'),
       (80, 5.7, 2.6, 3.5, 1.0, u'versicolor'),
       (81, 5.5, 2.4, 3.8, 1.1, u'versicolor'),
       (82, 5.5, 2.4, 3.7, 1.0, u'versicolor'),
       (83, 5.8, 2.7, 3.9, 1.2, u'versicolor'),
       (84, 6.0, 2.7, 5.1, 1.6, u'versicolor'),
       (85, 5.4, 3.0, 4.5, 1.5, u'versicolor'),
       (86, 6.0, 3.4, 4.5, 1.6, u'versicolor'),
       (87, 6.7, 3.1, 4.7, 1.5, u'versicolor'),
       (88, 6.3, 2.3, 4.4, 1.3, u'versicolor'),
       (89, 5.6, 3.0, 4.1, 1.3, u'versicolor'),
       (90, 5.5, 2.5, 4.0, 1.3, u'versicolor'),
       (91, 5.5, 2.6, 4.4, 1.2, u'versicolor'),
       (92, 6.1, 3.0, 4.6, 1.4, u'versicolor'),
       (93, 5.8, 2.6, 4.0, 1.2, u'versicolor'),
       (94, 5.0, 2.3, 3.3, 1.0, u'versicolor'),
       (95, 5.6, 2.7, 4.2, 1.3, u'versicolor'),
       (96, 5.7, 3.0, 4.2, 1.2, u'versicolor'),
       (97, 5.7, 2.9, 4.2, 1.3, u'versicolor'),
       (98, 6.2, 2.9, 4.3, 1.3, u'versicolor'),
       (99, 5.1, 2.5, 3.0, 1.1, u'versicolor'),
       (100, 5.7, 2.8, 4.1, 1.3, u'versicolor'),
       (101, 6.3, 3.3, 6.0, 2.5, u'virginica'),
       (102, 5.8, 2.7, 5.1, 1.9, u'virginica'),
       (103, 7.1, 3.0, 5.9, 2.1, u'virginica'),
       (104, 6.3, 2.9, 5.6, 1.8, u'virginica'),
       (105, 6.5, 3.0, 5.8, 2.2, u'virginica'),
       (106, 7.6, 3.0, 6.6, 2.1, u'virginica'),
       (107, 4.9, 2.5, 4.5, 1.7, u'virginica'),
       (108, 7.3, 2.9, 6.3, 1.8, u'virginica'),
       (109, 6.7, 2.5, 5.8, 1.8, u'virginica'),
       (110, 7.2, 3.6, 6.1, 2.5, u'virginica'),
       (111, 6.5, 3.2, 5.1, 2.0, u'virginica'),
       (112, 6.4, 2.7, 5.3, 1.9, u'virginica'),
       (113, 6.8, 3.0, 5.5, 2.1, u'virginica'),
       (114, 5.7, 2.5, 5.0, 2.0, u'virginica'),
       (115, 5.8, 2.8, 5.1, 2.4, u'virginica'),
       (116, 6.4, 3.2, 5.3, 2.3, u'virginica'),
       (117, 6.5, 3.0, 5.5, 1.8, u'virginica'),
       (118, 7.7, 3.8, 6.7, 2.2, u'virginica'),
       (119, 7.7, 2.6, 6.9, 2.3, u'virginica'),
       (120, 6.0, 2.2, 5.0, 1.5, u'virginica'),
       (121, 6.9, 3.2, 5.7, 2.3, u'virginica'),
       (122, 5.6, 2.8, 4.9, 2.0, u'virginica'),
       (123, 7.7, 2.8, 6.7, 2.0, u'virginica'),
       (124, 6.3, 2.7, 4.9, 1.8, u'virginica'),
       (125, 6.7, 3.3, 5.7, 2.1, u'virginica'),
       (126, 7.2, 3.2, 6.0, 1.8, u'virginica'),
       (127, 6.2, 2.8, 4.8, 1.8, u'virginica'),
       (128, 6.1, 3.0, 4.9, 1.8, u'virginica'),
       (129, 6.4, 2.8, 5.6, 2.1, u'virginica'),
       (130, 7.2, 3.0, 5.8, 1.6, u'virginica'),
       (131, 7.4, 2.8, 6.1, 1.9, u'virginica'),
       (132, 7.9, 3.8, 6.4, 2.0, u'virginica'),
       (133, 6.4, 2.8, 5.6, 2.2, u'virginica'),
       (134, 6.3, 2.8, 5.1, 1.5, u'virginica'),
       (135, 6.1, 2.6, 5.6, 1.4, u'virginica'),
       (136, 7.7, 3.0, 6.1, 2.3, u'virginica'),
       (137, 6.3, 3.4, 5.6, 2.4, u'virginica'),
       (138, 6.4, 3.1, 5.5, 1.8, u'virginica'),
       (139, 6.0, 3.0, 4.8, 1.8, u'virginica'),
       (140, 6.9, 3.1, 5.4, 2.1, u'virginica'),
       (141, 6.7, 3.1, 5.6, 2.4, u'virginica'),
       (142, 6.9, 3.1, 5.1, 2.3, u'virginica'),
       (143, 5.8, 2.7, 5.1, 1.9, u'virginica'),
       (144, 6.8, 3.2, 5.9, 2.3, u'virginica'),
       (145, 6.7, 3.3, 5.7, 2.5, u'virginica'),
       (146, 6.7, 3.0, 5.2, 2.3, u'virginica'),
       (147, 6.3, 2.5, 5.0, 1.9, u'virginica'),
       (148, 6.5, 3.0, 5.2, 2.0, u'virginica'),
       (149, 6.2, 3.4, 5.4, 2.3, u'virginica'),
       (150, 5.9, 3.0, 5.1, 1.8, u'virginica')], 
      dtype=[('Unnamed: 0', '<i8'), ('SepalLength', '<f8'), ('SepalWidth', '<f8'), ('PetalLength', '<f8'), ('PetalWidth', '<f8'), ('Species', 'O')])

URI Strings

Odo refers to foreign data either with a Python object like a sqlalchemy.Table object for a SQL table, or with a string URI, like postgresql://hostname::tablename.

URI's often take on the following form

protocol://path-to-resource::path-within-resource

Where path-to-resource might point to a file, a database hostname, etc. while path-within-resource might refer to a datapath or table name. Note the two main separators

  • :// separates the protocol on the left (sqlite, mongodb, ssh, hdfs, hive, ...)
  • :: separates the path within the database on the right (e.g. tablename)

odo docs on uri strings


Examples

Here are some example URIs

myfile.json
myfiles.*.csv'
postgresql://hostname::tablename
mongodb://hostname/db::collection
ssh://user@host:/path/to/myfile.csv
hdfs://user@host:/path/to/*.csv

Exercise

Migrate your CSV file into a table named iris in a new SQLite database at sqlite:///my.db. Remember to use the :: separator and to separate your database name from your table name.

odo docs on SQL


In [7]:
odo("data/iris.csv", "sqlite:///my.db::iris")


Out[7]:
Table('iris', MetaData(bind=Engine(sqlite:///my.db)), Column('Unnamed: 0', BigInteger(), table=<iris>, nullable=False), Column('SepalLength', Float(precision=53), table=<iris>), Column('SepalWidth', Float(precision=53), table=<iris>), Column('PetalLength', Float(precision=53), table=<iris>), Column('PetalWidth', Float(precision=53), table=<iris>), Column('Species', Text(), table=<iris>), schema=None)

What kind of object did you get receive as output? Call type on your result.


In [8]:
type(_)


Out[8]:
sqlalchemy.sql.schema.Table

How it works

Odo is a network of fast pairwise conversions between pairs of formats. We when we migrate between two formats we traverse a path of pairwise conversions.

We visualize that network below:

Each node represents a data format. Each directed edge represents a function to transform data between two formats. A single call to into may traverse multiple edges and multiple intermediate formats. Red nodes support larger-than-memory data.

A single call to into may traverse several intermediate formats calling on several conversion functions. For example, we when migrate a CSV file to a Mongo database we might take the following route:

  • Load in to a DataFrame (pandas.read_csv)
  • Convert to np.recarray (DataFrame.to_records)
  • Then to a Python Iterator (np.ndarray.tolist)
  • Finally to Mongo (pymongo.Collection.insert)

Alternatively we could write a special function that uses MongoDB's native CSV loader and shortcut this entire process with a direct edge CSV -> Mongo.

These functions are chosen because they are fast, often far faster than converting through a central serialization format.

This picture is actually from an older version of odo, when the graph was still small enough to visualize pleasantly. See odo docs for a more updated version.


Remote Data

We can interact with remote data in three locations

  1. On Amazon's S3 (this will be quick)
  2. On a remote machine via ssh
  3. On the Hadoop File System (HDFS)

For most of this we'll wait until we've seen Blaze, briefly we'll use S3.

S3

For now, we quickly grab a file from Amazon's S3.

This example depends on boto to interact with S3.

conda install boto

odo docs on aws


In [10]:
odo('s3://nyqpug/tips.csv', pd.DataFrame)


Out[10]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
7 26.88 3.12 Male No Sun Dinner 4
8 15.04 1.96 Male No Sun Dinner 2
9 14.78 3.23 Male No Sun Dinner 2
10 10.27 1.71 Male No Sun Dinner 2
11 35.26 5.00 Female No Sun Dinner 4
12 15.42 1.57 Male No Sun Dinner 2
13 18.43 3.00 Male No Sun Dinner 4
14 14.83 3.02 Female No Sun Dinner 2
15 21.58 3.92 Male No Sun Dinner 2
16 10.33 1.67 Female No Sun Dinner 3
17 16.29 3.71 Male No Sun Dinner 3
18 16.97 3.50 Female No Sun Dinner 3
19 20.65 3.35 Male No Sat Dinner 3
20 17.92 4.08 Male No Sat Dinner 2
21 20.29 2.75 Female No Sat Dinner 2
22 15.77 2.23 Female No Sat Dinner 2
23 39.42 7.58 Male No Sat Dinner 4
24 19.82 3.18 Male No Sat Dinner 2
25 17.81 2.34 Male No Sat Dinner 4
26 13.37 2.00 Male No Sat Dinner 2
27 12.69 2.00 Male No Sat Dinner 2
28 21.70 4.30 Male No Sat Dinner 2
29 19.65 3.00 Female No Sat Dinner 2
... ... ... ... ... ... ... ...
214 28.17 6.50 Female Yes Sat Dinner 3
215 12.90 1.10 Female Yes Sat Dinner 2
216 28.15 3.00 Male Yes Sat Dinner 5
217 11.59 1.50 Male Yes Sat Dinner 2
218 7.74 1.44 Male Yes Sat Dinner 2
219 30.14 3.09 Female Yes Sat Dinner 4
220 12.16 2.20 Male Yes Fri Lunch 2
221 13.42 3.48 Female Yes Fri Lunch 2
222 8.58 1.92 Male Yes Fri Lunch 1
223 15.98 3.00 Female No Fri Lunch 3
224 13.42 1.58 Male Yes Fri Lunch 2
225 16.27 2.50 Female Yes Fri Lunch 2
226 10.09 2.00 Female Yes Fri Lunch 2
227 20.45 3.00 Male No Sat Dinner 4
228 13.28 2.72 Male No Sat Dinner 2
229 22.12 2.88 Female Yes Sat Dinner 2
230 24.01 2.00 Male Yes Sat Dinner 4
231 15.69 3.00 Male Yes Sat Dinner 3
232 11.61 3.39 Male No Sat Dinner 2
233 10.77 1.47 Male No Sat Dinner 2
234 15.53 3.00 Male Yes Sat Dinner 2
235 10.07 1.25 Male No Sat Dinner 2
236 12.60 1.00 Male Yes Sat Dinner 2
237 32.83 1.17 Male Yes Sat Dinner 2
238 35.83 4.67 Female No Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

Blaze

Blaze translates a subset of numpy/pandas syntax into database queries. It hides away the database.

On simple datasets, like CSV files, Blaze acts like Pandas with slightly different syntax. In this case Blaze is just using Pandas.


Pandas example


In [11]:
import pandas as pd

df = pd.read_csv('data/iris.csv')
df.head(5)


Out[11]:
Unnamed: 0 SepalLength SepalWidth PetalLength PetalWidth Species
0 1 5.1 3.5 1.4 0.2 setosa
1 2 4.9 3.0 1.4 0.2 setosa
2 3 4.7 3.2 1.3 0.2 setosa
3 4 4.6 3.1 1.5 0.2 setosa
4 5 5.0 3.6 1.4 0.2 setosa

In [12]:
df.Species.unique()


Out[12]:
array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [12]:
df.Species.drop_duplicates()


Out[12]:
0          setosa
50     versicolor
100     virginica
Name: Species, dtype: object

Blaze example


In [13]:
import blaze as bz

d = bz.Data('data/iris.csv')
d.head(5)


Out[13]:
Unnamed: 0 SepalLength SepalWidth PetalLength PetalWidth Species
0 1 5.1 3.5 1.4 0.2 setosa
1 2 4.9 3.0 1.4 0.2 setosa
2 3 4.7 3.2 1.3 0.2 setosa
3 4 4.6 3.1 1.5 0.2 setosa
4 5 5.0 3.6 1.4 0.2 setosa

In [14]:
d.Species.distinct()


Out[14]:
Species
0 setosa
1 versicolor
2 virginica

Foreign Data

Blaze does different things under-the-hood on different kinds of data

  • CSV files: Pandas DataFrames (or iterators of DataFrames)
  • SQL tables: SQLAlchemy.
  • Mongo collections: PyMongo
  • ...

SQL

We'll play with SQL a lot during this tutorial. Blaze translates your query to SQLAlchemy. SQLAlchemy then translates to the SQL dialect of your database, your database then executes that query intelligently.

  • Blaze $\rightarrow$ SQLAlchemy $\rightarrow$ SQL $\rightarrow$ Database computation

This translation process lets analysts interact with a familiar interface while leveraging a potentially powerful database.

To keep things local we'll use SQLite, but this works with any database with a SQLAlchemy dialect. Examples in this section use the iris dataset. Exercises use the Lahman Baseball statistics database, year 2013.

If you have not downloaded this dataset you could do so here - https://github.com/jknecht/baseball-archive-sqlite/raw/master/lahman2013.sqlite.


Examples

Lets dive into Blaze Syntax. For simple queries it looks and feels similar to Pandas


In [15]:
db = bz.Data('sqlite:///my.db')
#db.iris
#db.iris.head()

In [16]:
db.iris.Species.distinct()


Out[16]:
Species
0 Species
1 setosa
2 versicolor
3 virginica

In [17]:
db.iris[db.iris.Species == 'versicolor'][['Species', 'SepalLength']]


Out[17]:
Species SepalLength
0 versicolor 7.0
1 versicolor 6.4
2 versicolor 6.9
3 versicolor 5.5
4 versicolor 6.5
5 versicolor 5.7
6 versicolor 6.3
7 versicolor 4.9
8 versicolor 6.6
9 versicolor 5.2
10 versicolor 5.0

Work happens on the database

If we were using pandas we would read the table into pandas, then use pandas' fast in-memory algorithms for computation. Here we translate your query into SQL and then send that query to the database to do the work.

  • Pandas $\leftarrow_\textrm{data}$ SQL, then Pandas computes
  • Blaze $\rightarrow_\textrm{query}$ SQL, then database computes

If we want to dive into the internal API we can inspect the query that Blaze transmits.



In [18]:
# Inspect SQL query
query = db.iris[db.iris.Species == 'versicolor'][['Species', 'SepalLength']]
print bz.compute(query)


SELECT iris."Species", iris."SepalLength" 
FROM iris 
WHERE iris."Species" = ?

In [19]:
query = bz.by(db.iris.Species, longest=db.iris.PetalLength.max(),
                              shortest=db.iris.PetalLength.min())
print bz.compute(query)


SELECT iris."Species", max(iris."PetalLength") AS longest, min(iris."PetalLength") AS shortest 
FROM iris GROUP BY iris."Species"

Exercises

Now we load the Lahman baseball database and perform similar queries


In [20]:
# db = bz.Data('postgresql://postgres:postgres@ec2-54-159-160-163.compute-1.amazonaws.com')  # Use Postgres if you don't have the sqlite file
db = bz.Data('sqlite:///data/lahman2013.sqlite')
db.dshape


Out[20]:
dshape("""{
  AllstarFull: var * {
    playerID: ?string,
    yearID: ?int32,
    gameNum: ?int32,
    gameID: ?string,
    teamID: ?string,
    lgID: ?string,
    GP: ?int32,
    startingPos: ?int32
    },
  Appearances: var * {
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    playerID: ?string,
    G_all: ?int32,
    GS: ?int32,
    G_batting: ?int32,
    G_defense: ?int32,
    G_p: ?int32,
    G_c: ?int32,
    G_1b: ?int32,
    G_2b: ?int32,
    G_3b: ?int32,
    G_ss: ?int32,
    G_lf: ?int32,
    G_cf: ?int32,
    G_rf: ?int32,
    G_of: ?int32,
    G_dh: ?int32,
    G_ph: ?int32,
    G_pr: ?int32
    },
  AwardsManagers: var * {
    playerID: ?string,
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    tie: ?string,
    notes: ?string
    },
  AwardsPlayers: var * {
    playerID: ?string,
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    tie: ?string,
    notes: ?string
    },
  AwardsShareManagers: var * {
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    playerID: ?string,
    pointsWon: ?int32,
    pointsMax: ?int32,
    votesFirst: ?int32
    },
  AwardsSharePlayers: var * {
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    playerID: ?string,
    pointsWon: ?float64,
    pointsMax: ?int32,
    votesFirst: ?float64
    },
  Batting: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    G: ?int32,
    G_batting: ?int32,
    AB: ?int32,
    R: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    RBI: ?int32,
    SB: ?int32,
    CS: ?int32,
    BB: ?int32,
    SO: ?int32,
    IBB: ?int32,
    HBP: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32,
    G_old: ?int32
    },
  BattingPost: var * {
    yearID: ?int32,
    round: ?string,
    playerID: ?string,
    teamID: ?string,
    lgID: ?string,
    G: ?int32,
    AB: ?int32,
    R: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    RBI: ?int32,
    SB: ?int32,
    CS: ?int32,
    BB: ?int32,
    SO: ?int32,
    IBB: ?int32,
    HBP: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  Fielding: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    POS: ?string,
    G: ?int32,
    GS: ?int32,
    InnOuts: ?int32,
    PO: ?int32,
    A: ?int32,
    E: ?int32,
    DP: ?int32,
    PB: ?int32,
    WP: ?int32,
    SB: ?int32,
    CS: ?int32,
    ZR: ?float64
    },
  FieldingOF: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    Glf: ?int32,
    Gcf: ?int32,
    Grf: ?int32
    },
  FieldingPost: var * {
    playerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    round: ?string,
    POS: ?string,
    G: ?int32,
    GS: ?int32,
    InnOuts: ?int32,
    PO: ?int32,
    A: ?int32,
    E: ?int32,
    DP: ?int32,
    TP: ?int32,
    PB: ?int32,
    SB: ?int32,
    CS: ?int32
    },
  HallOfFame: var * {
    playerID: ?string,
    yearid: ?int32,
    votedBy: ?string,
    ballots: ?int32,
    needed: ?int32,
    votes: ?int32,
    inducted: ?string,
    category: ?string,
    needed_note: ?string
    },
  Managers: var * {
    playerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    inseason: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32,
    rank: ?int32,
    plyrMgr: ?string
    },
  ManagersHalf: var * {
    playerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    inseason: ?int32,
    half: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32,
    rank: ?int32
    },
  Master: var * {
    playerID: ?string,
    birthYear: ?int32,
    birthMonth: ?int32,
    birthDay: ?int32,
    birthCountry: ?string,
    birthState: ?string,
    birthCity: ?string,
    deathYear: ?int32,
    deathMonth: ?int32,
    deathDay: ?int32,
    deathCountry: ?string,
    deathState: ?string,
    deathCity: ?string,
    nameFirst: ?string,
    nameLast: ?string,
    nameGiven: ?string,
    weight: ?int32,
    height: ?float64,
    bats: ?string,
    throws: ?string,
    debut: ?float64,
    finalGame: ?float64,
    retroID: ?string,
    bbrefID: ?string
    },
  Pitching: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    W: ?int32,
    L: ?int32,
    G: ?int32,
    GS: ?int32,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    H: ?int32,
    ER: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    BAOpp: ?float64,
    ERA: ?float64,
    IBB: ?int32,
    WP: ?int32,
    HBP: ?int32,
    BK: ?int32,
    BFP: ?int32,
    GF: ?int32,
    R: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  PitchingPost: var * {
    playerID: ?string,
    yearID: ?int32,
    round: ?string,
    teamID: ?string,
    lgID: ?string,
    W: ?int32,
    L: ?int32,
    G: ?int32,
    GS: ?int32,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    H: ?int32,
    ER: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    BAOpp: ?float64,
    ERA: ?float64,
    IBB: ?int32,
    WP: ?int32,
    HBP: ?int32,
    BK: ?int32,
    BFP: ?int32,
    GF: ?int32,
    R: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  Salaries: var * {
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    playerID: ?string,
    salary: ?float64
    },
  Schools: var * {
    schoolID: ?string,
    schoolName: ?string,
    schoolCity: ?string,
    schoolState: ?string,
    schoolNick: ?string
    },
  SchoolsPlayers: var * {
    playerID: ?string,
    schoolID: ?string,
    yearMin: ?int32,
    yearMax: ?int32
    },
  SeriesPost: var * {
    yearID: ?int32,
    round: ?string,
    teamIDwinner: ?string,
    lgIDwinner: ?string,
    teamIDloser: ?string,
    lgIDloser: ?string,
    wins: ?int32,
    losses: ?int32,
    ties: ?int32
    },
  Teams: var * {
    yearID: ?int32,
    lgID: ?string,
    teamID: ?string,
    franchID: ?string,
    divID: ?string,
    Rank: ?int32,
    G: ?int32,
    Ghome: ?int32,
    W: ?int32,
    L: ?int32,
    DivWin: ?string,
    WCWin: ?string,
    LgWin: ?string,
    WSWin: ?string,
    R: ?int32,
    AB: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    SB: ?int32,
    CS: ?int32,
    HBP: ?int32,
    SF: ?int32,
    RA: ?int32,
    ER: ?int32,
    ERA: ?float64,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    HA: ?int32,
    HRA: ?int32,
    BBA: ?int32,
    SOA: ?int32,
    E: ?int32,
    DP: ?int32,
    FP: ?float64,
    name: ?string,
    park: ?string,
    attendance: ?int32,
    BPF: ?int32,
    PPF: ?int32,
    teamIDBR: ?string,
    teamIDlahman45: ?string,
    teamIDretro: ?string
    },
  TeamsFranchises: var * {
    franchID: ?string,
    franchName: ?string,
    active: ?string,
    NAassoc: ?string
    },
  TeamsHalf: var * {
    yearID: ?int32,
    lgID: ?string,
    teamID: ?string,
    Half: ?string,
    divID: ?string,
    DivWin: ?string,
    Rank: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32
    },
  temp: var * {ID: ?int32, namefull: ?string, born: ?float64}
  }""")

In [21]:
# View the Salaries table
t = bz.Data('sqlite:///data/lahman2013.sqlite::Salaries')
t.dshape


Out[21]:
dshape("""var * {
  yearID: ?int32,
  teamID: ?string,
  lgID: ?string,
  playerID: ?string,
  salary: ?float64
  }""")

In [22]:
# What are the distinct teamIDs in the Salaries table?
t.teamID.distinct()


Out[22]:
teamID
0 ATL
1 BAL
2 BOS
3 CAL
4 CHA
5 CHN
6 CIN
7 CLE
8 DET
9 HOU
10 KCA

In [23]:
from odo import odo

odo(t.teamID.distinct(), list)


Out[23]:
[(u'ATL',),
 (u'BAL',),
 (u'BOS',),
 (u'CAL',),
 (u'CHA',),
 (u'CHN',),
 (u'CIN',),
 (u'CLE',),
 (u'DET',),
 (u'HOU',),
 (u'KCA',),
 (u'LAN',),
 (u'MIN',),
 (u'ML4',),
 (u'MON',),
 (u'NYA',),
 (u'NYN',),
 (u'OAK',),
 (u'PHI',),
 (u'PIT',),
 (u'SDN',),
 (u'SEA',),
 (u'SFN',),
 (u'SLN',),
 (u'TEX',),
 (u'TOR',),
 (u'COL',),
 (u'FLO',),
 (u'ANA',),
 (u'ARI',),
 (u'MIL',),
 (u'TBA',),
 (u'LAA',),
 (u'WAS',),
 (u'MIA',)]

In [24]:
query = t.teamID.distinct()
print bz.compute(query)


SELECT DISTINCT "Salaries"."teamID" AS "teamID" 
FROM "Salaries"

In [25]:
# What is the minimum and maximum yearID in the Salaries table?  
t.yearID.min()


Out[25]:
1985

In [26]:
t.yearID.max()


Out[26]:
2013

In [27]:
# For the Oakland Athletics (teamID OAK), pick out the playerID, salary, and yearID columns

t[t.teamID=='OAK'][['playerID', 'salary', 'yearID']]


Out[27]:
playerID salary yearID
0 atherke01 107333 1985
1 bakerdu01 575000 1985
2 birtsti01 60000 1985
3 bochtbr01 350000 1985
4 codirch01 141500 1985
5 collida02 700000 1985
6 davismi02 305000 1985
7 griffal01 610000 1985
8 heathmi02 425000 1985
9 hendest01 150000 1985
10 howelja01 262500 1985

In [28]:
oak = t[t.teamID=='OAK'][['playerID', 'salary', 'yearID']]

In [29]:
oak


Out[29]:
playerID salary yearID
0 atherke01 107333 1985
1 bakerdu01 575000 1985
2 birtsti01 60000 1985
3 bochtbr01 350000 1985
4 codirch01 141500 1985
5 collida02 700000 1985
6 davismi02 305000 1985
7 griffal01 610000 1985
8 heathmi02 425000 1985
9 hendest01 150000 1985
10 howelja01 262500 1985

In [30]:
odo(oak, 'oak.csv')


Out[30]:
<odo.backends.csv.CSV at 0x10f7eec90>

In [31]:
# Sort that result by salary.  
# Use the ascending=False keyword argument to the sort function to find the highest paid players

oak.sort('salary',ascending=False)


Out[31]:
playerID salary yearID
0 hollima01 13500000 2009
1 kendaja01 12858194 2007
2 chaveer01 12500000 2010
3 dyeje01 11666667 2003
4 dyeje01 11666667 2004
5 chaveer01 11500000 2008
6 chaveer01 11500000 2009
7 kendaja01 11492454 2006
8 kendaja01 10571429 2005
9 sheetbe01 10000000 2010
10 chaveer01 9500000 2006

Example: Split-apply-combine

In Pandas we perform computations on a per-group basis with the groupby operator. In Blaze our syntax is slightly different, using instead the by function.


In [32]:
import pandas as pd
iris = pd.read_csv('data/iris.csv')
iris.groupby('Species').PetalLength.min()


Out[32]:
Species
setosa        1.0
versicolor    3.0
virginica     4.5
Name: PetalLength, dtype: float64

In [33]:
iris = bz.Data('sqlite:///my.db::iris')
bz.by(iris.Species, largest=iris.PetalLength.max(), 
                    smallest=iris.PetalLength.min())
print(_)


Species
setosa        1.0
versicolor    3.0
virginica     4.5
Name: PetalLength, dtype: float64

Store Results

By default Blaze only shows us the first ten lines of a result. This provides a more interactive feel and stops us from accidentally crushing our system. Sometimes we do want to compute all of the results and store them someplace.

Blaze expressions are valid sources for odo. So we can store our results in any format.


Exercise: Storage

The solution to the first split-apply-combine problem is below. Store that result in a list, a CSV file, and in a new SQL table in our database (use a uri like sqlite://... to specify the SQL table.)


In [34]:
result = bz.by(db.Salaries.teamID, avg=db.Salaries.salary.mean(), 
                                   max=db.Salaries.salary.max(), 
                                   ratio=db.Salaries.salary.max() / db.Salaries.salary.min()
                ).sort('ratio', ascending=False)

In [35]:
odo(result, list)[:10]


Out[35]:
[(u'PHI', 2092230.9326364691, 25000000.0, 416.6666666666667),
 (u'LAN', 2346982.6980255516, 23854494.0, 397.5749),
 (u'NYN', 2317349.977245509, 23145011.0, 385.7501833333333),
 (u'DET', 1980834.9902080784, 23000000.0, 383.3333333333333),
 (u'MIN', 1525031.650385604, 23000000.0, 383.3333333333333),
 (u'BOS', 2692113.8568075118, 22500000.0, 375.0),
 (u'SFN', 2044198.6841463414, 22250000.0, 370.8333333333333),
 (u'TEX', 1874651.551143201, 22000000.0, 366.6666666666667),
 (u'SEA', 1932288.913202934, 20557143.0, 342.61905),
 (u'CHA', 1992653.5012315272, 17000000.0, 340.0)]

In [36]:
odo(result, 'sqlite:///my.db::result')


Out[36]:
Table('result', MetaData(bind=Engine(sqlite:///my.db)), Column('teamID', Text(), table=<result>), Column('avg', Float(precision=53), table=<result>, nullable=False), Column('max', Float(precision=53), table=<result>), Column('ratio', Float(precision=53), table=<result>), schema=None)